Intro

Below is:

A summary of key points that cropped up again and again.
A set of proposed readings by topic

Note that these notes haven't been edited or extended with audio recordings I've made, and some need updating with significantly more material. See the "Proposed Readings by Topic* section or ask me for more details.

Summary

Both the financial and academic worlds are increasingly adopting Python for the same reasons.
- They're often encumbered with extremely large, heterogenous, legacy systems.
  - Old work isn't discarded. Incremental additions, re-use over interfaces.
  - Think COBOL/Excel/VBA for finance, FORTRAN/C in academia.
- They're both seeking one paradigm as an end-to-end high-level solution for all users.
- Neither can sacrifice performance, yet are finding the time-to-market and development lifecycles too long with legacy homogenous systems.
- Python has long had a reputation for being unable to deliver performance, but this is now commonly acknowledged to no longer be true.
  - Python easily serves as a glue to high performance wrappers such as NumPy and SciPy, file formats such as HDF5, and heterogenous computation backends such as shared-memory parallelism (SMP) (OpenMP via Cython), GPUs (CUDA) and FPGAs (OpenCL).
  - Achieving C/C++ performance can be done with significantly simpler code and designs, yet requires sophisticated knowledge of memory cache hierarchies, disk I/O patterns, and SMP issues.
- The financial industry has long relied on Python as an interface to core, high-performance components, but are paranoid and extremely closed and the only way knowledge gets shared is by stealing employees from other financial companies.
- More information:
IPython Notebook is universally used and loved.
- Most technical talks used IPython Notebook either for all the content or to host demos/examples.
- Of those talks around half provided direct links to a pre-initialised IPython Notebook, so attendees could either download and follow in real-time, or download later.
- Deep and expanding integration with entire scientific Python ecosystem, e.g. matplotlib, pandas, numpy, bokeh, sympy, ...
- Good examples are:
No clear future for visualisations in Python.
- matplotlib is universally used for publication-quality charts. API is difficult to use but powerful and well engineered.
- It's clear that web-based visualisations are the future, and very important even for publications.
- In order to reach the browser it's also acknowledged that JavaScript is the ideal interface, rather than static images.
- But how to reach browser? Many different perspectives:
  - IPython Notebook - use matplotlib magic incantation to draw charts, no interactivity, no JavaScript.
    - Other libraries build on top of matplotlib of course work just as well: ggplot, seaborn, prettyplotlib
  - "04 - Visualisations in Bokeh": people love ggplot in R because of the Grammar of Graphics, and people love Python because it's a one-stop stop
    - So use Python with a ggplot-like grammar to auto-generate HTML5-canvas backed web visualisations using JavaScript.
    - HTML5-canvas is an investment and should reap rewards over SVG-based libraries like d3.js for very complex visualisations.
  - "12 - Getting it out there - Python-JS-web-viz": forget Python, just code front-end in JavaScript and defer back-end and data cleaning to Python.
    - d3.js, nvd3, crossfilter, rickshaw, ...
  - Lightning talks
    - One presenter uses Python over a websocket bridge to Angular.js to create an RShiny-type interactive chart environment.
    - Another presenter showed off IPython version 2 (coming end of April 2014) functionality with interactive widgets, dynamically recreating charts based on user input.
- There is no clear conclusion, except matplotlib is fantastic work and stood the test of time.
- Bokeh seems very exciting but rough around the edges with a large and difficult to install set of dependencies, worth exploring the tutorials in full (!!AI which I will, and post a new article).
Cython is almost universally used, but more agile methods are being sought
- Cython is a strict superset of Python that, with annotations, allow it to reach C-like speeds.
- These annotations no longer allow Python compatibility. Can the Python community do better? Perhaps, with Shedskin, Pythran, or Numba. PyPy may eventually reach the Holy Grail of numpy compatibility.
- See:
  - "10 - The High Performance Python Landscape"
  - "15 - Building a Cutting-Edge Data Processing Environment on a Budget"
  - "03 - Faster Python Programs through Optimization" (!!AI there is a significant quantity of missing information, I will type up soon).
  - "11 - Shared Memory Parallelism with Python"
Everyone uses scikit-learn
- Well thought out, very opinionated design, strong and diverse set of core contributors.
- Stands out amongst Python packages as having > 10 contributors who equally make the same volume of contributions.
- At the very least prototype in scikit-learn and nltk.
  - If you hit scalability issues people usually scale vertically (bigger boxes) or use Cython.
- See:
MapReduce/clusters have less hype and traction than you'd expect
- Certainly there are some who use it for their data processing pipeline, e.g. "07 - Hierarchical Text Clustering in Python and Hive".
- Given a large data set that cannot fit onto one disk, prefer to create large RDBMS clusters. See:
  - "05 - Databases for Scientists"
  - "08 - Massively Parallel Processing with Procedural Python"
    - Presenter Notes
- Given a large data set that cannot fit into memory prefer to use e.g. HDF5 to disk-back it, or create additional abstractions on top of NumPy/HDF5, a la "23 - Manipulating massive disk-backed arrays".
- scikit-learn core contributors strongly prefer shared-memory parallelism to clusters, and are actively creating OpenMP-style abstractions (with better debugging and NumPy array performance).
  - "15 - Building a Cutting Edge Data Processing Environment on a Budget"
- Fantastic lightning talk on end used FireDrake to easily switch computing backend from SMP to GPU, but again no mention of clusters.

Proposed readings by group

Culture, industry background

Case studies

Technical - software engineering

"03 - Faster Python Programs through Optimization"
- Needs updating with significant amount of presenter material we didn't cover.
"10 - The High Performance Python Landscape"
"11 - Shared Memory Parallelism with Python"
"15 - Building a Cutting-Edge Data Processing Environment on a Budget"

Technical - mathematical

Technical - other

"01 - Interactive Financial Analytics with Python and IPython"
- Presenter's tutorial where I followed along with exercises are here: "01 - YH_PyData_Eurex_Tutorial"
"05 - Databases for Scientists"
- Needs updating with presenter's material.
"08 - Massively Parallel Processing with Procedural Python"
- And presenter notes: "08 - presenter notes"
"16 - Generator Showcase Showdown"
- And presenter notes: "16 - presenter notes"
"23 - Manipulating massive disk-backed arrays"

Visualisations

"04 - Visualisations Using Bokeh"
- I need to significantly update with the rest of their tutorial examples.
"12 - Getting it out there - Python-JS-web-viz"
"21 - Winning Ways for Your Visualization Plays"